86 research outputs found

    Generating Literal and Implied Subquestions to Fact-check Complex Claims

    Full text link
    Verifying complex political claims is a challenging task, especially when politicians use various tactics to subtly misrepresent the facts. Automatic fact-checking systems fall short here, and their predictions like "half-true" are not very useful in isolation, since we have no idea which parts of the claim are true and which are not. In this work, we focus on decomposing a complex claim into a comprehensive set of yes-no subquestions whose answers influence the veracity of the claim. We present ClaimDecomp, a dataset of decompositions for over 1000 claims. Given a claim and its verification paragraph written by fact-checkers, our trained annotators write subquestions covering both explicit propositions of the original claim and its implicit facets, such as asking about additional political context that changes our view of the claim's veracity. We study whether state-of-the-art models can generate such subquestions, showing that these models generate reasonable questions to ask, but predicting the comprehensive set of subquestions from the original claim without evidence remains challenging. We further show that these subquestions can help identify relevant evidence to fact-check the full claim and derive the veracity through their answers, suggesting that they can be useful pieces of a fact-checking pipeline

    Using Natural Language Explanations to Rescale Human Judgments

    Full text link
    The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over crowdworker judgments. However, annotators' judgments for subjective tasks can differ in many ways: they may have different qualitative judgments about an example, and they may map those to a labeling scheme in different ways. We show that these nuances can be captured by natural language explanations, and propose a method to rescale ordinal annotations and explanations using LLMs. Specifically, we feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. These scores should reflect the annotators' underlying assessments of the example. The rubric can be designed or modified after annotation, and include distinctions that may not have been known when the original error taxonomy was devised. We explore our technique in the context of rating system outputs for a document-grounded question answering task, where LLMs achieve near-human performance. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.Comment: Data available at https://github.com/ManyaWadhwa/explanation_based_rescalin

    How to Evaluate Semantic Communications for Images with ViTScore Metric?

    Full text link
    Semantic communications (SC) have been expected to be a new paradigm shifting to catalyze the next generation communication, whose main concerns shift from accurate bit transmission to effective semantic information exchange in communications. However, the previous and widely-used metrics for images are not applicable to evaluate the image semantic similarity in SC. Classical metrics to measure the similarity between two images usually rely on the pixel level or the structural level, such as the PSNR and the MS-SSIM. Straightforwardly using some tailored metrics based on deep-learning methods in CV community, such as the LPIPS, is infeasible for SC. To tackle this, inspired by BERTScore in NLP community, we propose a novel metric for evaluating image semantic similarity, named Vision Transformer Score (ViTScore). We prove theoretically that ViTScore has 3 important properties, including symmetry, boundedness, and normalization, which make ViTScore convenient and intuitive for image measurement. To evaluate the performance of ViTScore, we compare ViTScore with 3 typical metrics (PSNR, MS-SSIM, and LPIPS) through 5 classes of experiments. Experimental results demonstrate that ViTScore can better evaluate the image semantic similarity than the other 3 typical metrics, which indicates that ViTScore is an effective performance metric when deployed in SC scenarios

    Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

    Full text link
    The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code are available at https://github.com/qijimrc/ROBUST.Comment: Accepted by EMNLP 2023 Main Conferenc

    VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering

    Full text link
    We present Visual Knowledge oriented Programming platform (VisKoP), a knowledge base question answering (KBQA) system that integrates human into the loop to edit and debug the knowledge base (KB) queries. VisKoP not only provides a neural program induction module, which converts natural language questions into knowledge oriented program language (KoPL), but also maps KoPL programs into graphical elements. KoPL programs can be edited with simple graphical operators, such as dragging to add knowledge operators and slot filling to designate operator arguments. Moreover, VisKoP provides auto-completion for its knowledge base schema and users can easily debug the KoPL program by checking its intermediate results. To facilitate the practical KBQA on a million-entity-level KB, we design a highly efficient KoPL execution engine for the back-end. Experiment results show that VisKoP is highly efficient and user interaction can fix a large portion of wrong KoPL programs to acquire the correct answer. The VisKoP online demo https://demoviskop.xlore.cn (Stable release of this paper) and https://viskop.xlore.cn (Beta release with new features), highly efficient KoPL engine https://pypi.org/project/kopl-engine, and screencast video https://youtu.be/zAbJtxFPTXo are now publicly available

    Annual report 1984-1985

    Get PDF
    BACKGROUND: HOTAIR, a newly discovered long intergenic noncoding RNA (lincRNA), has been reported to be aberrantly expressed in many types of cancers. This meta-analysis summarizes its potential role as a biomarker in malignancy. METHODS: A quantitative meta-analysis was performed through a systematic search in Pubmed, Medline and Web of Science for eligible papers on the prognostic impact of HOTAIR in cancer from inception to Feb. 28, 2014. Pooled hazard ratios (HRs) with 95% confidence interval (95% CI) were calculated to summarize the effect. RESULTS: Nineteen studies were included in the study, with a total of 2033 patients. A significant association was observed between high HOTAIR expression and poor overall survival (OS) in patients with cancer (pooled HR 2.22, 95% CI: 1.68-2.93). Place of residence (Asian or Western countries), type of cancer (digestive or non-digestive disease), sample size (more or less than 100), and paper quality (score more or less than 85%) did not alter the significant predictive value of HOTAIR in OS from various kinds of cancer but preoperative status did. By combining HRs from Cox multivariate analyses, we found that HOTAIR expression was an independent prognostic factor for cancer patients (pooled HR 2.26, 95% CI: 1.62-3.15). Subgroup analysis showed that HOTAIR abundance was an independent prognostic factor for cancer metastasis (HR 3.90, 95% CI: 2.25-6.74). For esophageal carcinoma, high HOTAIR expression was significantly associated with TNM stage (III/IV vs. I/II: OR 6.90, 95% CI: 2.81-16.9) without heterogeneity. In gastric cancer, HOTAIR expression was found to be significantly associated with lymph node metastases (present vs. absent: OR 4.47, 95% CI: 1.88-10.63) and vessel invasion (positive vs. negative: OR 2.88, 95% CI: 1.38-6.04) without obvious heterogeneity. CONCLUSIONS: HOTAIR abundance may serve as a novel predictive factor for poor prognosis in different types of cancers in both Asian and Western countries
    corecore